severe performance regression on virtual disk migration for qcow2 on ZFS with 5.15.39-2-pve

fabian · Apr 17, 2023

thanks for the heads up!

Neobin · Apr 20, 2023

RolandK said:
that metioned patch seems to have problems (is in 2.1.10, so be careful !!!) @Stoiko Ivanov @fabian @fiona

https://github.com/openzfs/zfs/issues/14753

https://github.com/openzfs/zfs/releases/tag/zfs-2.1.10

ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() #13368

2.1.11 got released, reverting it:
https://github.com/openzfs/zfs/releases/tag/zfs-2.1.11

raceme · Feb 18, 2024

RolandK said:
i have dug into this some more and found that the performance-problem goes away when setting "relatime=on" or "atime=off" for the dataset.
[...]
ADDON (after finding out more , see below) :
the performance problem also goes away when setting preallocation-policy to "off" on the datastore.

Many many many thanks for that one !
I was becoming mad since a week...
I made a mistake last week as I made two different upgrades in the same time: I upgraded PVE 7 to 8 and TrueNAS 12 to 13. Everything went ok (except lost of connectivity with ifupdown2, perhaps I forgot to read some release notes, my bad).
But during the night backup job it was hell: at the morning some VMs were stuck with IO from 9600k modem time: for example a VM took 25 minutes to backup on PVE7 and the same VM, the day after on PVE8 took 1h48 to backup !
I have two PVE clusters and 3 TrueNAS sharing NFS for qcow2 files and it was really a mistery...
Since 2 days I tried many things without success: mitigations=off, iothread=1, virtio scsi single, even mtu tweaking or crossing mount between the 2 clusters and the 3 TrueNAS: but none of tests gave logical results.
iperf3 give max bandwidth, with dd on PVE host or inside a VM bandwidth is on the top, but moving a disk (offline) never ends.
I was moving a 64 Gb disk while finding your post: during 5 hours it moved 33%, I went on the TrueNAS, put atime=off and moved the 67% remaining in a few minutes !
I added the option preallocation off on all my nfs datastores and launched a backup which seems to go many faster than previously.
I am not sure that all regressions are gone but is is far better for the moment... I'll check carefully during next days.

RolandK · Feb 18, 2024

what zfs version do your truenas have?

raceme · Feb 18, 2024

RolandK said:
what zfs version do your truenas have?

It is TrueNAS 13.0-U6.1:
zfs-2.1.14-1
zfs-kmod-v2023120100-zfs_f4871096b
But I did not activate the new features yet.

I checked the backups and there are still slowness issues compared to before and many weird results:

             OS        Hardware   Virtio     Guest    Disk                       Backup on PVE v7     Backup on PVE v8qemu-kvm

version Agent size
117 pve2-1 Deb10 i440fx 3.1.0 16 Gb cache=writethrough 00:10:54 6.28 GB 00:34:44 6.21 GB
207 pve2-1 Win2k19 i440fx-5.2 21500 102.10.0 512 Gb cache=writethrough 00:27:28 35.50 GB 01:19:53 36.11 GB
253 pve2-1 W10/22H2 i440fx-7.2 22900 105.00.2 64 Gb cache=writethrough 00:09:14 16.49 GB 04:43:19 16.26 GB
101 pve2-2 Deb12 i440fx 5.2.0 32 Gb cache=writethrough 00:11:44 4.23 GB 00:06:15 4.11 GB
107 pve2-2 Deb9 i440fx 3.1.0 32 Gb cache=writethrough 00:02:45 0.82 GB 01:45:32 0.87 GB
146 pve2-2 Win2k19 i440fx-5.2 22900 105.00.2 128 Gb cache=writeback 00:26:17 21.26 GB 00:55:37 21.55 GB
210 pve2-2 Deb10 i440fx 16 Gb cache=writethrough 00:02:16 0.72 GB 00:02:05 0.72 GB
231 pve2-2 Deb11 i440fx 16 Gb cache=writethrough 00:03:31 1.05 GB 01:09:50 1.13 GB
220 pve2-3 Win2k19 i440fx-5.1 21500 102.10.0 256 Gb cache=default(no) 00:17:07 19.86 GB 00:09:37 20.22 GB
105 pve1-1 Deb10 i440fx 3.1.0 32 Gb cache=writethrough 00:06:47 2.30 GB 02:23:23 2.28 GB
132 pve1-4 Win2k19 i440fx-5.1 21500 102.10.0 64 Gb cache=default(no) 00:14:31 12.03 GB 02:41:26 12.59 GB
133 pve1-4 Win2k19 i440fx-5.1 21500 102.10.0 64 Gb cache=default(no) 00:12:58 12.14 GB 00:08:09 12.67 GB

RolandK · Mar 14, 2024

i have observed a similar performance issue on zfs shared via samba, unfortunately i'm not yet able to reproduce.

when live migrating a qcow2 virtual disk hosted on zfs/samba share to disk to local ssd, i observed pathological slowness and saw lot's of write IOPs on the source (!) whereas i would not expect any write iops for this. it did not go away by disabling atime, as before.
i had seen similar slownewss during backup when vdisk was on zfs/samba share. i'm using that for clrearing up local ssd space when a VM is obsolete or not used for longer.

the read performance during live migrate was very well below 1MB/s

i stopped the migration and tried to migrate the virtual disk offline, which performed surprisingly fast and worked without a problem.

then i did migrate the vdisk back to the samba share , which also performed well and then tried again, but for my curiousity, the problem was gone and isn't reproducible.

very very weird....

Search

Search

severe performance regression on virtual disk migration for qcow2 on ZFS with 5.15.39-2-pve

fabian

Proxmox Staff Member

Neobin

Distinguished Member

raceme

Member

RolandK

Renowned Member

raceme

Member

RolandK

Renowned Member

We value your privacy